Introduction

Throughout this report we will be exploring a data set that consists of State of the Union speeches from George Washington (1790) to Barack Obama (2016). Within this report there were several areas that we wanted to explore such as visualizing the speeches, how the speeches might be related in terms of Political Party and Presidential Rating, and also how the speeches evolve in time. We will show this through multiple methods including dimension reduction, sentimental analysis, word frequencies and cosine similarities. We will start with visualizations using different methods of dimension reduction to explore Political Party, Presidential Rating and Year of the speech.

image

Data Visualization

Text can be an extremely difficult data type to visualize. Within a document term matrix (DTM) there can be tens of thousands of words, associated with only a handful of documents. This creates a very sparse matrix that is hard to conceptualize. One way to combat the sparsity is to use dimension reduction techniques in order to see if relationships are more apparent in different representations of the data.

Three separate forms of reduction were used on this data set, which includes Principal Component Analysis (PCA), Multidimensional Scaling (MDS) and t-distributed Stochastic Neighbor Embedding (T-SNE). We want to compare these different methods to try and determine if the speeches can be grouped in these alternative representations.

All three of these forms used only the Document Term Matrix. In all cases we reduced the data down to two dimensions to simplify the resulting visualizations.

This data includes each president’s speech, so one president can appear in the data multiple times. We perform this analysis in conjunction with other types of text processing/visualization in order to provide a complete overview of the data.

Political Party

First we look into the Political Party to see if we can find any distinct relationships between how a party speaks during their Presidential Speeches. This could be possible because typically the government has always been divided into two ideals, power for the states, and power for the federal government.
image


From this figure we can see that for PCA we can see some clustering of the Democrats and the Republicans. W can also see the Federalist, Democrat-Republican in a right group on the lower left portion of the PCA Cluster. For MDS we scan see some similar groups as PCA but for T-SNE there is no visible difference between political parties.

We further explore this area by investigating the sentiment of presidential speeches by party. Here we only compare the modern Democratic and Republican parties which have existed in their current incarnations since 1852.

image

Sentiment score was calculated by taking all words in a speech, determining the sentiment of each individual word based on two corpora of positive and negative words, summing all positive words and negative words, and then taking the difference between total positive sentiment and total negative sentiment words.

While there does note appear to be any inherent differences in the distributions of sentiment between the two parties, we do note that the vast majority of speeches appear to contain a majority of positive words.

We continue to compare difference between Democratic Party and Republican Party looking at 88 speeches of presidents at Republican, and 84 speeches of presidents at Democrat.

Here are TOP30 words that were used at speeches of Republican.

Words frequency of Republican

Here are TOP30 words that were used at speeches of Democrat

Words frequency of Democrat

As we can see above, TOP 3 words were “will”“,”state“,”govern" at both Republican and Democrat. In addition, we can see that almost same words such as as “congress”, and “American” were used at both parties.

By using Word Cloud, we can see similarities between both parties visually.

Wordcloud of Republican Speakers

Wordcloud of Democrat Speakers

Here, we will see similarity of speeches between Democrat and Republican by using cosine similarity. Cosine similarity can show the similarity between two vectors, and can be expressed in a range from 0 to 1.

Top 30 cosine similarity

name1 name2 freq
1 Martin van Buren/Democratic Woodrow Wilson/Democratic 0.802355625239314
2 Martin van Buren/Democratic Calvin Coolidge/Republican 0.797540100245719
3 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.795087373085944
4 Martin van Buren/Democratic Andrew Johnson/Republican 0.782093075680344
5 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.772948334297628
6 Martin van Buren/Democratic Andrew Johnson/Republican 0.76223006663571
7 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.761731659737625
8 Martin van Buren/Democratic Abraham Lincoln/Republican 0.75576317138654
9 Martin van Buren/Democratic Calvin Coolidge/Republican 0.754761050208218
10 Martin van Buren/Democratic John F. Kennedy/Democratic 0.753969318552649
11 Martin van Buren/Democratic Franklin D. Roosevelt/Democratic 0.749417310663843
12 Andrew Johnson/Republican William McKinley/Republican 0.749120024980519
13 Martin van Buren/Democratic Andrew Johnson/Republican 0.748283595459302
14 Martin van Buren/Democratic Dwight D. Eisenhower/Republican 0.747932101586138
15 Martin van Buren/Democratic Calvin Coolidge/Republican 0.747472988697713
16 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.747262497713423
17 Martin van Buren/Democratic Harry S. Truman/Democratic 0.745340053039387
18 Martin van Buren/Democratic Dwight D. Eisenhower/Republican 0.742942071173964
19 Martin van Buren/Democratic Abraham Lincoln/Republican 0.741128327795825
20 Martin van Buren/Democratic John F. Kennedy/Democratic 0.740829039193294
21 Martin van Buren/Democratic Dwight D. Eisenhower/Republican 0.739460907831584
22 Andrew Johnson/Republican William H. Taft/Republican 0.739336360549158
23 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.738564122261049
24 Martin van Buren/Democratic Calvin Coolidge/Republican 0.7384688765084
25 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.737385498896977
26 Ulysses S. Grant/Republican Theodore Roosevelt/Republican 0.737339619435886
27 Andrew Johnson/Republican William H. Taft/Republican 0.7370184066313
28 Ulysses S. Grant/Republican William H. Taft/Republican 0.736852828438062
29 Martin van Buren/Democratic Calvin Coolidge/Republican 0.736781837865062
30 Martin van Buren/Democratic Ulysses S. Grant/Republican 0.736105457804099

As we can see above, top row is between Democrat and Democrat, but from second row to ninth row, there are combinations between Republican and Democrat. It is interesting to see that Martin Van Buren(Democratic party) has lots of high similarities between many other presidents. The average cosine similarities of speeches between Democrat and Democrat is 0.4030646, while the cosine similarities between Republican and Republican is 0.4113491. The cosine similarities between Republican and Democrat is 0.4405719.

Before executing this analysis, we thought that we would be able to see some difference between Democrat and Republican in terms of contents of speeches. According to bar plots and cosine similarities, contents themselves do not have major difference, if we look at speeches from 1790 to 2016.

Given that we have not found substantial differences in the aggregate, we were interested in seeing how each presidents speeches changed over time. So we selected the two most recent presidents, President Barack Obama (Democrat) and President George W. Bush (Republican), to take a closer look of their words frequencies over their tenure of office, ie. we look at the 8 speeches of President Obama and the 9 speeches of President Bush.

image


Here are TOP10 most frequently used words in speeches of President Obama. We can easily tell from the plot that President Obama emphasizes on american/america and job/work, which represent the emphasis on the country and the job market.

Words frequency of President Barack Obama

Here are TOP10 most frequently used words that in speeches of President Bush. Notice that Bush also frequently used words like “america”, “american”, “nation”, but he does not mention that much about jobs. He uses “applause” most frequently.

Words frequency of President George W. Bush

To see similarities in use of words between President Obama and President Bush visually, we use word cloud as shown below.

Wordcloud of President Barack Obama

Wordcloud of President George W. Bush

Next we look at the top 10 most frequently used words for each year in the presidents’ tenure. We use heat map to visualize the frequency of each year. Note that x-axis includes all the top 10 words from each year in the tenure.

Reactive Heatmap of President Barack Obama

From the above heat map, we can see that President Obama mentioned the word “tax” for 31 in his speech in 2012. But he never mentioned it during any other years in his tenure. Further, from 2010 to 2015, he keeps mentioning “job”. In contrast, he did not mention job at all in his 2009 and 2016 speeches.

Reactive Heatmap of President George W. Bush

The heat map shows that President Bush mentioned the word “tax” for around 20 times during 2001 and 2004 while never mention it in other years. It is not surprising to see that he used “Iraq” for 21 times in his 2007 speech and it is interesting to find that he also used “health” for 18 times during that same speech.

Presidential Year

Presidential year was also a potentially interesting area to look into. The presidents of the united states have had many different challenges throughout the history of America and could show significant patterns in the words that they choose for the speech.

The legend below shows the color of the points based on year:
image image


We can see from this image we can see much better groupings than for the Political Parties and the Ratings. This is most likely due to the trends, problems and speech changes over time and is most likely reflected in the words that the have chosen.

To get an idea of what words might be causing this change we can look at an Importance Plot that was generated using Random Forests, in order to predict the year based on which words they have chosen in their speech. Many of these point to some notion of time, either the word itself, or how old the word usage may be. For instance, today addressing the Congressional body as “Gentleman” is very incorrect because in today’s government it is made up of men and women, while in the earlier history it was made up of only men. We can see Korea, which could be allow the Random Forests to identify presidents that were leading the country during the Korean War.

image

We also wanted to investigate this in the 3rd dimension. This was done to see if there were any more patterns that could be visualized within the time dimensionality of the speeches. T-SNE was no longer included because no visible patterns were seen in the two dimensional plots.

The darker the color, the older the speech.

Pan to Zoon, Click and drag to rotate

image

You must enable Javascript to view this page properly.

As before we can see some clustering between the different years of the speeches. MDS however has enough spread where we plotted MDS only, along with the names of the presidents at each speech.

The darker the color, the older the speech.

Pan to Zoon, Click and drag to rotate

image

You must enable Javascript to view this page properly.

We can see that the more recent presidents, Obama, Bush and Clinton are very close to each other for all of their speeches. They are interestingly closer to the older presidents (Jefferson, Washington, Adams etc) than many of the other presidents throughout history.

The outliers of the plot are also interesting. Truman, Taft, Carter, Polk, which exist very far from the large cluster of presidents. While researching these presidents nothing in particular set them apart from the others. Some unique facts about them though are that Truman was president during World War 2, Taft became Chief Justice of the Supreme Courts, Polk based his campaign on a promise not re-run as president, and Carter negotiated peace talks between Egypt and Israel.

We conclude this analysis by investigating the impact of poor economic conditions with the sentiment of the speech. We looked at all of the speeches from 1857 to present (the time period for which we have macroeconomic data for the United States) and determined which of these speeches were given in periods of economic recession.

image

Much like our sentiment analysis of speech by political party, we see that almost all speeches contain many more positive words than negative ones. These two analyses together are indicative of the unrelentingly positive tone of presidential speeches.

Presidential Rating

We also wanted to investigate the speeches towards their overall presidential rating. It could be that Presidents who are doing well, have a positive outlook on the country and have a more positive message. We computed the quantiles of the presidential ratings an then grouped them by the different groups of percentiles as indicated by the legend.

image


We can see in PCA and MDS that there does seem to be a tight grouping on the +50% of the presidents and somewhat for the -50%. T-SNE again does not show any significant pattern.

We then look at the relationship between Sentiment Score and Ratings Points.


Here we do not see an obvious relationship; the variables are only slightly correlated with \(\rho\) = -0.180. Based on this result and the results of PCA, MDS and T-SNE analysis, we believe that any relationship that may exist between Rating Points and the nature of State of the Union address would be due to more complex features than the simple measure of the difference between the number of positive and negative sentiment words.

Decode: Top 10 frequent words

In the previous part, we confirmed that some presidents are clustered while some are not. Visualizing the top 10 words from each presidents’ speeches in a heat map shows the fact that some words appear frequently over years. For instance,‘will’, ‘govern’, ‘nation’ are high frequency words over years. And ‘work’, ‘job’, ‘secure’ are new top words that appear in President Barack Obama’s speeches.

(In each pixel box, readers will see information of president, word and its frequency. The X-axis didn’t show all words due to the plot size, but readers are able to see the detailed information when hovering over the heat map.)

Conclusion

Throughout this report we have explored many different methods of understanding textual data. We have seen visualizations using dimension reduction, sentiment analysis and word counts allowing us to gain different perspectives on the information contained in these documents. Many different insights have been discussed and we can see that some of the approaches allow for the discovery of new features, such as dimension reduction and sentiment analysis. These features could help with a machine learning task such as classification or predicting the year.